High Accuracy Tagging with Large Tagsets

نویسنده

  • Dan Tufiş
چکیده

The paper presents experiments and results related to morpho-syntactic (MS) tagging of a highly inflectional language, based on combining language models (LM) learnt from multiple register-diversified corpora. To cope with a large tagset (614 tags), our underlying tagger uses a hidden smaller tagset (92 tags), mapped back, after the proper tagging, into the initial tagset. The same text is tagged in as many variants as language models are available. The tag differences between these variants are reconciled by a combiner which, based on a specific combination method, chooses the winning tags. The combined version of the tagged text is subject to the mapping process that converts the tags from the reduced tagset onto the more informative tags from the large tagset. The mapping is almost deterministic and uses a few contextual rules (regular-expressions) to deal with the rare cases of mapping ambiguity. 1. Large tagsets and tiered Lexical ambiguity resolution is one of the most important tasks in natural language processing. It can be regarded as a classification problem: an ambiguous lexical item is one that in different contexts can be classified differently and, given a specific context, the disambiguator/classifier decides on the appropriate class. The features that are relevant for the classification task are encoded into the tags. The larger the tagset, the larger the training corpora needed [1]. To avoid severe sparseness of the data and accuracy degradation, a huge amount of manual work would be necessary for building appropriately large training corpora. We describe one possible way to avoid such an unrealistic solution. With a small price in tagging accuracy and practically no price in computational resources, it is possible to tag a text in terms of a large tagset by using LMs built for a reduced tagset and consequently requiring much smaller training corpora and computational resources. We call this way of tagging tiered tagging [10, 11]. In general terms, tiered tagging uses a hidden tagset (we call it C-tagset) of a smaller size (in our case 92 tags) based on which a LM is built. This LM serves for a first level of tagging. Then, a post-processor deterministically replaces the tags from the small tagset with one or more (in our experiments, usually 2) tags from the large tagset (we call it MSD-tagset) which contains 614 tags (MSDs). The words that after this replacement become ambiguous are more often than not the difficult cases in statistical disambiguation (such as determiners vs. possessive pronouns). They represent a small percentage (in our experiment, less than 6%) and are further processed by means of a few very simple regular-expression rules. Certainly, the reduced and the extended tagsets have to be in a specific relation (C-tagset should subsume MSDtagset). In [10] it is shown how a C-tagset can be interactively designed from an MSD-tagset, based on a trial&error ID3-like procedure. The design considers the property of C-tagset recoverability described as below. We use the following notations: Wi, represents a word, Ti represents a tag from the reduced tagset assigned to Wi, MSDk represents a tag from the large tagset, AMB(Wk) represents the ambiguity class of the word Wk in terms of MSDs (as encoded in the lexicon), MAP is an application that maps each Ti onto a subset of MSD-set and |X| represents the cardinal (number of elements) of the set X. ∀ Ti ∈C-tagset, MAP(Ti)= {MSD1...MSDk} ⊂ MSD-set ∀Wk∈Lex & AMB(Wk)={MSDk1...MSDkn}⊂MSD-set ⇒

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tagset Mapping and Statistical Training Data Cleaning-up

The paper describes a general method (as well as its implementation and evaluation) for deriving mapping systems for different tagsets available in existing training corpora (gold standards) for a specific language. For each pair of corpora (tagged with different tagsets), one such mapping system is derived. This mapping system is then used to improve the tagging of each of the two corpora with...

متن کامل

Internal and external tagsets in part-of-speech tagging

We present an approach to statistical partof-speech tagging that uses two di erent tagsets, one for its internal and one for its external representation. The internal tagset is used in the underlying Markov model, while the external tagset constitutes the output of the tagger. The internal tagset can be modi ed and optimized to increase tagging accuracy (with respect to the external tagset). We...

متن کامل

Using a Large Set of EAGLES-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging

The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagset...

متن کامل

The paper describes a general method (as well as its implementation and evaluation) for deriving the mapping rules for the dif

The paper describes a general method (as well as its implementation and evaluation) for deriving mapping models for different tagsets available in existing training corpora (gold standards) for a specific language. These mapping models are further used to significantly improve the accuracy in the underlying training corpora and also for the assessment of the distributional adequacy of various t...

متن کامل

Dimensionality of dialogue act tagsets

This article compares one-dimensional and multi-dimensional dialogue act tagsets used for automatic labeling of utterances. The influence of tagset dimensionality on tagging accuracy is first discussed theoretically, then based on empirical data from human and automatic annotations of large scale resources, using four existing tagsets: DAMSL, SWBD-DAMSL, ICSI-MRDA and MALTUS. The Dominant Funct...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008